CAGEF_services_slide.png

Lesson 6 - Linear Regression, Multiple Linear Regression, ANOVA, ANCOVA: Choosing the Best Model for the Job test

xkcd_linear_regression.png

plot_confusion.png


0.1.0 About Introduction to R

Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style - it is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.

0.1.1 Where is this course headed?

We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:

data-science-explore.png

0.1.2 How do we get there? Step-by-step.

In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.

Draw_an_Owl-2.jpg

The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.


0.2.0 Class Objectives

This is the sixth in a series of seven lectures. Last lecture we concluded our journey with data wrangling in the world of regular expressions. This week we will again explore that "clean" data that we've learned to generate using the world of statistical models. At the end of this session you will be able to perform simple and multiple linear regression, one- and multi-way analysis of variance (ANOVA) and analysis of covariance (ANCOVA). You will be able to interpret the statistics that come out of this model, be cognizant of the assumptions made by the model, and use F-test and Akaike Information Criterium (AIC) to select the best model for the job.

  1. Basic analysis of datasets and their distribution
  2. Linear regression models
  3. Generalized linear models
  4. Review
  5. Resources

0.3.0 A legend for text format in Jupyter markdown

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R

0.4.0 Lecture and data files used in this course

0.4.1 Weekly Lecture and skeleton files

Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.

Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.

0.4.2 Live-coding HTML page

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture PDFs and Recordings

As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.


0.4.4 Data used in this session

The dataset we will use for this lesson is from the Summer Institute in Statistical Genetics (SISG) at the University of Washington's course in Regression and Analysis of Variance by Lurdes Inoue. This lesson uses a lot of material from the SISG 2016 course as well as conceptual material from Ben Bolker. I like this dataset because it has a number of categorical and continuous variables, which allows us to use the same dataset for many models. Also, the variables are familiar (age, BMI, sex, cholesterol), which makes data interpretation easier while we are in the learning stage.

We'll read the data in and take a look at the structure.

0.4.4.1 Dataset 1: SISG-Data-cholesterol.txt

A space-separated file consisting of measurements related to lipid health and haplotype data for variants at loci that may influence these values. We'll be using this dataset to explore and model relationships between genotype and phenotype.


0.5.0 Packages used in this lesson

The following packages are used in this lesson:

Some of these packages should already be installed into your Anaconda base from previous lectures. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda. conda install -c conda-forge r-biocmanager BiocManager::install("limma") conda install -c conda-forge r-gee conda install -c conda-forge r-multcomp


0.6.0 A little stats is a dangerous thing: an important aside about this lecture

I am not a statistician and likely have just enough knowledge to realize that I mostly know nothing about statistics. The tools and methods we discuss today are to familiarize you with the concept of creating simple linear models but you should always think deeply about your data and approach it with the right statistical toolset.

Before embarking on your journey, ask if your data meets all the criteria for this type of analysis. Read papers on similar subjects and see what the consensus is!

Lastly, don't get trapped on Mount Stupid of the Dunning-Kruger curve! We're aiming for the Valley! To despair... and beyond!

dunning-kruger-effect-curve.png


1.0.0 Answering questions with data

In order to work with our data we need to be able to answer some basic questions.

These are all really important questions that we may or may not think about as we try to dive in and get our answer as quickly as possible. Today we are going to slow down a bit and think about our data and our models.

Let's load our dataset today with a new function, read.delim(), where we will specify that the file is space-separated with a header.

Why use read.delim()?: Up until now we've mostly stuck to functions supplied from the readr package, but we're going back to the base R functions to use read.delim() because our data includes row names! If you look closely at the text file, there is one less column header in the first row than the other columns of data in the file. Unfortunately read_delim() does not handle this data format correctly whereas read.delim() will!

1.0.1 A brief breakdown of our data

This dataset maps genetic variants (single nucleotide polymorphisms or SNPs) and their relationship to cholesterol (chol) and triglycerides (TG) for 3 genes: rs174548 (FADS1 - an enzyme in fatty acid unsaturation), rs4775401 (a candidate SNP), and APOE (a major apolipoprotein important for Alzheimer's disease).

ID sex age chol BMI TG rs174548 rs4775401 APOE
patient code Male=0
Female=1
patient age cholesterol Body mass
index
triglycerides haplotype
C/C = 0
C/G = 1
G/G = 2
haplotype
C/C = 0
C/T = 1
T/T = 2
allele haplotype
e2/e2 = 1
e2/e3 = 2
e2/e4 = 3
e3/e3 = 4
e3/e4 = 5
e4/e4 = 6

1.0.2 Use exploratory plots to begin identifying relationships in your dataset

We are ultimately interested in the relationship between the above genetic variants and cholesterol, while controlling for factors such as age and sex. But let's get our feet wet by starting with the easier question: is there an association between mean serum cholesterol and age?

For this question cholesterol is the dependent variable, or the variable being measured. Age is the independent variable that we are changing to determine the effect on cholesterol.

It is always, always, always, a good idea to make an 'exploratory' plot of your data and get an idea of what its distribution looks like. We can start with a simple scatter plot of age and cholesterol.


1.0.3 Generate summary statistics to explore your data

Our data above looks pretty spread out along these axes. Let's generate some summary statistics to explore it further. We can describe our data with some basic summary statistics such as the mean, median, mode, min, max, standard deviation, and variance (central tendency and dispersion measures). R does not have a mode function relating to statistics, but we can figure it out for ourselves.


1.0.4 Make a density plot with geom_density to visualize your sample distribution

Our mean, median and mode are not that different, and so our data is not skewed in either direction. We can also prove this to ourselves by making a quick density plot to visualize the residuals (each data point minus the mean)


1.0.5 Use quantile() to gauge the range of your distribution

Another way to dissect our distribution is by examining the values/boundaries of our dataset when we break it into various portions. We can accomplish the task of breaking up our data using the quantile() function. We can set the levels of proportions we want to generate

Using the default behaviour of quantile() generates quartiles of our data which will also give us a good sense of the range and distribution of our data.


Going back to our question regarding if age is related to cholesterol, let's add the mean cholesterol to our plot for reference. This is done by adding geom_hline() to our plot and specifying the value for the yintercept.

From our plot, it looks like the mean might increase with age, but how do we test this?


1.1.0 Preparing data for a Student's t-test

T-tests are a simple statistical tool to compare pairs of means, i.e. between two groups at the time. We don't currently have age groups, but we can make them. One way to do this is to use our dplyr skills to create a new column 'age_group'. The data can be split at 55 years-old (the midpoint of age in our data).

We can use an if/else statement (the ifelse function) to test: is age greater than 55? If the answer is 'yes', the value is 1, and if the answer is 'no', the value is 0. We can take a quick look at our dataset to make sure this worked.

Remember mutate(data, new_col = criteria_or_calculation)


1.2.0 Does your data meet the minimum criteria?

A t-test is a fairly straight forward yet powerful test to determine if two means are statistically (significantly) different. However, there are some assumptions (criteria) that our data needs to meet in order to apply a t-test:

Before applying a t-test, let's make sure our data comes from a normally distributed population. From above, the density plot we generated suggests values are part of a normal distribution but let's check with some other tests.

1.2.1 Shapiro-Wilks test for normality

This is a common test for normality that essentially compares your data to a normal distribution and provides a p-value for the likelihood of the null hypothesis. The NULL(H0) is that the samples came from the Normal distribution.

Shapiro requires sample size's between 3 and 5,000. If your p-value $\leq$ 0.05, then you would reject the NULL (H0) hypothesis - suggesting your data is not from a normal distribution.


1.2.2 Histograms can simulate density

As we mentioned in our ggplot class from lecture 04, we can generate a quick histogram to look at how our data is roughly distributed. This is similar to a density plot but using the values or our data binned into group. It may, however, look quite familiar to you already.


1.2.3 Quantile-quantile plots

What is a quantile-quantile plot (Q-Q plot)?

It's a scatter plot. More specifically, it's a scatter plot of your values (sorted in ascending order) against a sample size-matched number of quantiles from a chosen theoretical distribution. There are multiple ways to generate a Q-Q plot in R but here we will use:

What do we want to see?

Essentially a straight line. Depending on the shape and tails, you may need to transform your data to match a normal distribution. If that doesn't seem possible, then you really should not continue with a parametric test like the t-test.

More about Q-Q plots: There are many different distributions that can be identified by the outline of their Q-Q plot and recognizing these can help you decide if you need to take further steps to transform your data (See the Appendix for more details). You can also visit some helpful posts from around the web to help clarify these ideas.

1.2.4 Homoscedasticity looks at the variance of "noise" in our data

Also written as homoskedasticity it relates to the variance within each group. If we generate a regression model, is the residual, or error term, constant? Otherwise, if the error term differs significantly across independent variables (ie age_group) then the model is missing a large effect from another independent variable.

We'll use Bartlett's test with bartlett.test() which essentially tests if k groups have equal variances and returns a K-squared value and p-value against our null hypothesis (H0 = the groups have the same variance). The call we will use takes the form:

    bartlett.test(formula = values ~ grouping, data = dataset)

where values ~ grouping means to use the data from the variable values and to group it by grouping when retrieving from dataset.

Afterwards, we will compare the output of our test against a $\chi^{2}$ value with probability $(1-\alpha)$ and $df = k-1$ with qchisq()

A quick warning: this test assumes normality for your samples - if this is not the case, a rejection of H0 may be a rejection of normality instead!


1.3.0 Visualize our groups of data further by boxplot

In our case, from testing above, the data follows a normal distribution and although the variance between groups is not exactly identical, it is still homoscedastic! We can now use a boxplot to look at the distribution of cholesterol for our 2 groups.

Boxplots are a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of the box are the first and third quartiles (or 25th and 75th percentiles) of your data. The whiskers extend from the 1st and 3rd quartiles to the closest values that are no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles). Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.


1.4.0 Finally, a t-test to compare our distributions

There seems to be a lot of overlap in our cholesterol values. How do we tell if the means are truly different?

Let's think about this a little more explicitly:

The null hypothesis (H0) is that there is no difference in the sample means between our groups.

An alternative hypothesis (H1) is that there is a difference between the means (2-sided test), or that the difference in means is greater or lesser than zero (1-sided test).


$\alpha$ is our p-value, and $\mu$ is the population mean, k is our sample mean. Remember that we are estimating the true population mean using the sample that we have. Our p-value is the probability of finding our observed value by chance given that the null hypothesis is true.

We will use a simple Student's t-test to test the alternative hypothesis that the true difference in means is not equal to 0.

The t.test function takes as input the variables on which we are performing the test (as a vector), the "type" of t-test being performed ("two.sided", "less", or "greater"), and the confidence interval. The confidence interval is the interval that will cover the true parameter x% of the time. In the image above the confidence interval covers the pink area, (1-$\alpha$).

You can alternatively enter your variables in a formula, in this case y ~ x. The ~ in r language is used to separate the left and right sides of a formula. (You can run a 1-sided t-test by specifying alternative = 'greater' or alternative = 'less'). In this case, alternative = 'two-sided' and conf.level = 0.95 are the default parameters and only included for clarity. For now we are assuming that equal variance is true from our previous tests.


1.4.1 Interpreting our t-test

Our output tells us the mean cholesterol for those aged 30-55 is 179.98 mg/dl and the mean for those aged 56-80 is 187.89 mg/dl. The difference in means is significant at a p-value of 0.0003146 (< 0.05).

So we now know there is a positive relationship between cholesterol and age. However the t-test has limitations. What is the magnitude of this relationship during aging? Does it change by approximately the same amount per year? What if we don't want to break our data into groups?

For the answer to this, we'll need to open learn a new toolset.


2.0.0 Linear regression models of our dataset

Screenshot_2018-04-23_09-03-16.png

Source: "All (or most) of statistics". Bolker, 2007.

There are a ton of models (or families of models) out there for different statistical purposes and with different assumptions. These assumptions, if violated, can give incorrect predictions. However, we might not know if these assumptions are true when selecting our model. Today we are hanging out in the top left corner, and we are going to learn the assumptions of linear models in general, the specific models we will be using today, and an example of each. We will trouble-shoot when assumptions fail later in the lesson.


2.1.0 Assumptions of general linear models

1. Observed values are independent of each other (independence)

2. Variation around expected values (residuals) are normally distributed (normality)

Screenshot%20from%202020-07-30%2011-23-26.png (James et al. (2017). An Introduction to Statistical Learning)

y_dist.png

3. constant variance, homoscedastic (equal variance)

homo.png

4. observed values (y) are related by linear functions of the parameters to x (linearity)

Assumptions 2 and 3 are often grouped together.


2.2.0 Linear Models

For simple linear regression we are modeling a continuous outcome by a single continuous variable. Example: modeling cholesterol using BMI.

For multiple linear regression we are modeling a continuous outcome by more than one continuous variable. Example: modeling cholesterol using BMI and age. In this case, we must consider whether there is an interaction between age and BMI on cholesterol (more on interactions to follow).

For one-way ANOVA we are modeling a continuous outcome by a single categorical variable. Example: modeling cholesterol by sex. It is important that categorical variables are explicitly input as factors to be interpreted properly in the model. For example, since we have encoded sex as 0 and 1 (instead of 'M' and 'F'), we need to specify that sex is to be treated as ordinal and not as a number. Therefore we specify sex as a factor of 2 levels, 0 and 1.

For multi-way ANOVA we are modeling a continuous outcome by more than one categorical variable. Example: modeling cholesterol by sex and APOE genetic variants. Again, we need to consider any interaction between our categorical variables, and we need to specify our numeric values to be treated as categorical variables and not numbers. APOE will be a factor of 6 levels, one for each genetic variant.

Lastly, for ANCOVA we are modeling a continuous variable by a combination of categorical AND continuous variables. Example: modeling cholesterol using the genetic variants of APOE and BMI. Again, our categorical variable must be input as a factor. ANCOVA allows for each group (each genetic variant of APOE in this example) to have a separate slope.

This is a summary table you might find helpful for choosing a model based on the data types you have, and the assumptions you are making. I hope to show that model selection is akin to going through mental checklist for your data, and not that scary. The independence assumption is required for all the models below, and is not included in the chart for spacing reasons.

model categorical continuous linearity normality homoscedasticity
simple linear regression X $\checkmark$ $\checkmark$ $\checkmark$ $\checkmark$
multiple linear regression X $\checkmark\checkmark$ $\checkmark$ $\checkmark$ $\checkmark$
one-way analysis of variance (ANOVA) $\checkmark$ X $\checkmark$ $\checkmark$ $\checkmark$
multi-way ANOVA $\checkmark\checkmark$ X $\checkmark$ $\checkmark$ $\checkmark$
analysis of covariance (ANCOVA) $\checkmark$ $\checkmark$ $\checkmark$ $\checkmark$ $\checkmark$
nonlinear least squares X $\checkmark$ X $\checkmark$ $\checkmark$
nonlinear ANCOVA $\checkmark$ $\checkmark$ X $\checkmark$ $\checkmark$
generalized linear models $\checkmark$ $\checkmark$ X* X X

*restricted cases


2.2.1 Revisiting our question

What is the relationship between cholesterol and age?

Now, we can pick a model to answer our question by considering the assumptions above instead of a t-test.

If we evaluate our independent and dependent variables, age and cholesterol, they are both continuous, not categorical. We only have one independent variable. From the plot we made earlier (repeated below) it looks like if there is a relationship between age and cholesterol it would be linear.

Based on the above criteria, we will try using a simple linear regression to test the association between mean serum cholesterol and age.


2.3.0 Simple linear regression made simple with ggplot()

What we are looking for, is the slope of the line relating cholesterol to age, which will tell us the magnitude and direction of the relationship between these variables. We can look at the slope for the linear model that ggplot() would fit for us for an idea of what our model will look like.


2.3.1 Review: the equation for a straight line.

To make sense of what we're seeing on our plot, let's review the equation for a straight line:

\begin{equation*}\Large Y \sim Normal(a + bx, {\sigma^2}) \end{equation*}

y is our dependent variable that we are attempting to model. x is our independent variable.
a is the intercept (the value of y where x = 0; where x crosses the y-axis).
b is the slope of the line (the change in y corresponding to a unit increase in x).
Normal is telling us that our error is normally distributed.
$\sigma^2$ is the variance (squared deviation of the variable x from its mean)

Slopes

The interpretation in our example is that the slope is the difference in mean serum cholesterol associated with a one year increase in age.

With a straight line we are not, of course, plotting through all of our points, but rather close to the mean of an outcome in y as a function of x. For example, there are only six values of cholesterol for the 50 year-old age group, and our line will fall somewhere close to the mean of these values. Values of y have a distribution at a given x, which we have assumed is normally distributed.

Lastly, in this equation we also have some normally distributed error. Sampling error exists in our estimates because different estimates give different means.

How do we actually find the best fitting line? We use least squares estimation, which minimizes the sum of squares of the vertical distances from the observed points to the least squares regression line ($y -\hat{y}$) .

$y$ - observed value
$\hat{y}$ - estimated value
$\bar{y}$ - sample mean

least_squares.png



Let's run this simple linear regression. Using R, the intercept and slope terms are implicit.

R code:

lm(y ~ x)

There are times when the intercept, the value of y at x = 0, doesn't make much intuitive sense to interpret our data. To force the intercept to zero (y = bx) to have relative comparisons instead, use lm(y ~ x-1).

As we are used to writing equations, our dependent variable (cholesterol) is on the left side the lm formula and our independent variable (age) is on the right side; tilde ~ separates these sides. We also input the dataset to the lm function.

The function will output our formula, the slope and the intercept. However, if we save the output of the function into an object, 'fit', we get a list object of the model, the input, and all associated statistics. We can look at a summary() and get residuals, errors, p-values and more in addition to our coefficients.


2.3.2 Use summary() to look at the salient information about your model

As you can see from looking at the structure, the lm model object holds a lot of information. We can use the summary() function to pull out some of the more important information about our model instead of wading through all of the data. It should be noted, that this information can vary between different types of objects as well.


2.3.3 Interpreting our linear model

The intercept is 166.9 and the slope is 0.31. What does that actually mean? It means a baby (age 0 or 0 years) would be expected to have on average a serum cholesterol of 166.9 mg/dl. For every yearly increase in age, mean serum cholesterol is expected to increase by 0.31 mg/dl. These results are significant with a p-value < 0.001. We can reject the null hypothesis and say that mean serum cholesterol is significantly higher in older individuals. The Adjusted R-squared value tells us that about 4% (0.03858) of the variability in cholesterol is explained by age.

Note there are two sets of p-values presented: Pr(>|t|), and a p-value associated with the F-statistic.

2.3.4 Use confint() to generate confidence intervals on your data

Given that we have a model object like our ageFit linear model, we can generate a confidence interval using the confint() method. By default this function sets the argument level to 0.95 or a 95% confidence interval level. Let's generate a confidence interval for our model parameters.

We can further get confidence intervals for these values to say that 95% of the time we expect the cholesterol of a baby to fall within 158.5-175.3 mg/dl, or that we are 95% confident that the difference in mean cholesterol associated with a one year increase in age is between 0.16 and 0.46 mg/dl.

Be precise about confidence Intervals! Just a reminder that a 95% confidence interval does not mean there is a 95% probability that the true population value (ie mean) lies between our range. We don't know the true population value or how close it is to our estimated value. We know with 100% certainty that it is either inside or outside the interval. More precisely, the correct interpretation is that if we were to randomly sample from our population multiple times, we expect that 95% of our constructed intervals would contain the true population value! Therefore we can say with 95% certainty that our calculated interval contains the true population value. It's a very fine distinction. On the bright side, we can also note that a wide interval suggests a less precise estimate vs a small interval. Of course the size of our interval is influenced by our sample size with narrower intervals generated by higher sample numbers. Likewise using a smaller confidence value (ie 90%) produces a smaller interval as well.

drake_confidence.jpg


2.4.0 Multiple linear regression

In multiple linear regression we use multiple continuous dependent variables to predict outcome values. Additional terms can be added in 2 ways.

2.4.1 Adding powers of a variable (polynomial regression)

It was mentioned before that the 'linear' part of linear regression is the linear function of the parameters and not the independent variables. In the example below, the parameters $a$, $b_1$, and $b_2$ are linear even though the independent variable has a quadratic component, $x^2$. An example of this could be synthesizing a chemical, where with increasing temperature, synthesis progresses with an increasing curve.

Expression:

\begin{equation*}\Large Y \sim Normal(a + b_1x + b_2x^2, {\sigma^2}) \end{equation*}

If we were to write this in R, again our intercept and coefficients are implicit. To write the quadratic term we use the function I which just means 'as.is'. This allows the interpreter to see the ^ as a power symbol and not as an interaction term.

R code:

lm(y ~ x + I(x^2))

Suppose from our data that we think cholesterol can be modeled by age as a quadratic equation. What would that look like?

It looks like this modification to our model explains our model's variation (Adjusted R-squared) slighlty more but doesn't drastically improve on it.


2.4.2 Adding extra continuous variables to our model

We are interested in improving our model by adding an extra variable we think might have an effect on our outcome values. In the example below, we are adding the independent variables x1, x2, and each of these terms has their own linear parameter b1 and b2, respectively.

Expression:

\begin{equation*}\Large Y \sim Normal(a + b_1x_1 + b_2x_2, {\sigma^2}) \end{equation*}

This is the model we will be using next. To aid with interpretation let's think about a parameter. b2 is the expected mean change in unit per change in x2 if x1 is held constant (sometimes called controlling for x1).

The null hypothesis in this case is that all b1, b2 = 0. The alternative hypothesis is that at least one of these parameters is not null.

Again in R the intercept and coefficients are implicit in the the lm function.

R code:

lm(y ~ x1 + x2)

We know that age has an effect on cholesterol. With our new model we want to ask the question: Is there a statistically significant relationship between mean serum cholesterol and age after controlling for BMI? Let's look graphically at these relationships to help us understand our model. First let's plot BMI vs cholesterol. We can add a linear fit to make sure we are expecting a positive slope.

We should also take a look at the relationship between BMI and age.


2.4.2.1 Building our multiple linear regression model

From our analysis, we see that cholesterol increases with BMI. Furthermore, we see that BMI tends to increases with age. We will look at the association of age and cholesterol while holding BMI constant. This will indicate if the significance of our previous relationship between increasing cholesterol with increasing age was affected by BMI.


2.4.3 Interpreting our multiple linear regression results

Our equation would now look like $y = 137.16 + 0.20age + 1.43BMI$.

The estimated increase in mean serum cholesterol after one year while holding BMI constant is 0.20 mg/dl. This increase is less than our previous value of 0.31 mg/dl. Why do the estimates differ?

Before, we were not controlling for BMI. Our estimates of the age associated increase in mean cholesterol is now for subjects with the same BMI and not for subjects with all BMIs.

It looks like both age and BMI are significant.From the Adjusted R-squared value, it looks like we now explain nearly 7% of our variablity. Still, we might want to verify if adding BMI actually made a difference to the model.


2.5.0 Comparing models with anova()

We can compare our two models with the anova() function. The output of our models, ageFit and ageBMIFit, are lm objects. With 2 lm objects, the anova() function tests the models against one another to see if their coefficients are significantly different and prints these results in an analysis of variance table.

Given a single lm object, it will test whether model terms of a model are significant - we will be using the function in this format later.


2.5.1 Interpreting the result of our comparison by the F-value

Our second model is significantly different from our first model. What is this significance based on?

The significance is a probability based on an F-test. While the t-test tells you if two means (from a single variable in our case) are statistically significant, an F-test tells you if a group of variables is jointly significant.

The F-test compares the joint effect of all variables together, a large F value means 'something' is significant. We can calculate the F-value ourselves from the information provided in the table

Model Res.df RSS df Sum of Squares F Pr(>F)
0 398 187187.5 NA NA NA NA
1 397 180841.7 1 6345.76 13.93079 0.0002173567


Note that the Residual Sum of Squares (RSS) is also known as the Sum of Squared Errors (SSE). Recall that this is the based on the difference between our data and our model estimates for the same data point ($y -\hat{y}$)2. This represents the total variation/difference between our observed data and the model we have generated.

modelVar.vs.residualVar.png https://stats.stackexchange.com/questions/172157/how-to-interpret-an-anova-table-comparing-full-vs-reduced-ols-models

The degrees of freedom are calculated by the total sample size minus the number of groups relevant to the model. Note that cholesterol has 400 entries and we have split the variable $age$ into two groups {0,1}.


2.5.2 Basic formulas for calculating F-value


\begin{equation*} \begin{aligned} SS_\Delta &= RSS_{0} - RSS_{1} &\text{The difference in errors generated by our null (0) and new (1) models}\\[10pt] SS_\Delta &= \text{Sum of Squares}_{1} &\text{The further apart our models are, the larger this will be}\\[10pt] MS_\Delta &= \frac{SS_\Delta}{\text{Res.df}_\Delta} &\text{The mean square for the difference between models}\\[10pt] MS_{R1} &= \frac{RSS_{1}}{\text{Res.df}_{1}} &\text{The mean square for the residuals of our test model}\\[10pt] F &= \frac{MS_\Delta}{MS_{R1}} &\text{The ratio of the variation between models and residuals of the new model} \end{aligned} \end{equation*}


Let's see those model comparisons again:

Model Res.df RSS df Sum of Squares F Pr(>F)
0 398 187187.5 NA NA NA NA
1 397 180841.7 1 6345.76 13.93079 0.0002173567

2.5.3 Every F-statistic is an F-value with an accompanying p-value

So from some of the values supplied by anova() we can confirm that correct calculation of the F value.

Note: F-tests are not used alone because you still need to use a p-value to find out 'what' is significant. This p-value is part of the F-statistic and calculated as follows:

pf(F-value, Res.df-delta, Res.df1, lower.tail=FALSE).

The pf() function generates an F-distribution given the supplied degrees of freedom and finds the area ander the curve at the upper tail starting at the supplied F-value.

From that calculation we see that Pr(>F) is smaller than our $\alpha$ of 0.05.

Put simply, this results suggests that by increasing the complexity of our model to account for BMI, we have succeeded in improving the fit to our data since we already know from above that a higher percent of variance is explained in ageBMIFit compared to ageFit.


2.5.4 Comparing models with AIC()

The Akaike Information Criterion (AIC) is a another method for model selection. The AIC works to balance the trade-offs between the complexity of a given model and its goodness of fit (how well the models fits the data), where the best model is the one that provides the maximum fit for the fewest predictors. Its interpretation is fairly straight forward: The lower the score, the better a model is relative to the other models. If many models have similarly low AICs, you should choose the one with the fewest model terms (variables or parameters).

Given that ageBMIFit has a lower score, we can say that is the best model out of the two evaluated.


2.6.0 Adding interaction terms to our model

What is meant by an interaction?

There is an interaction if the association between the response and the predictor variable changes across the range of the new variable. This can be seen in the expression below, where the difference in means between x1 and x2 changes additionally by b3 for each unit difference in x2 or x1, i.e. the slope of x1 changes with x2, because b3 is changing.

In our case study, interaction refers to how BMI (the third variable) affects the interaction between cholesterol and age.

Expression:

\begin{equation*}\Large Y \sim Normal(a + b_1x_1 + b_2x_2 + b_3x_1x_2, {\sigma^2}) \end{equation*}


In the graph below, there is an interaction between education and ideology. The slope indicating the probability that people will care if sea level rises 20 feet, changes with each education level and each shift in ideology. If there was no interaction with ideology and education, the slopes shown would be parallel.

GSS_sealevel_interaction.png



When testing for an interaction between 2 input variables, the lm() input takes an asterisk * instead of a + between the dependent variables.

R code:

lm(y ~ x1 * x2)

An interaction is different than a confounding factor, for which the association between the response and predictor variable is constant across the range of the new variable. You can think of confounding factor as variables that have an effect on the outcome, but haven't been accounted for.

For example, in our first model where the increase in cholesterol was ONLY due to an increase in age, BMI would be a confounding factor because weight contributes significantly to an increase in cholesterol, and age alone is not responsible for the increase in cholesterol.


2.6.1 Challenge: include an interaction term within our model of cholesterol prediction

Test if there is an interaction between age and BMI in a model predicting mean serum cholesterol. Note that these are both continuous variables.

  1. Is the interaction significant?
  2. Is there a difference between this model and the model with age as the only variable?
  3. Is there a difference between this model and the model of BMI and age model with no interaction?

From our analyses above, the comparison suggest that our ageBMIFit model is not significantly different when using an interaction model, ageIntBMIFit. This is confirmed with the results of the Akaike Information Criterion test as well.

Crossing versus Interactions While in our above examples we used the * (asterisk) as an operator for interaction it is actually denoted as crossing in our formulas. This means for the formula y ~ a*b we are saying that y is a function of a, b and their interaction. Sometimes you may wish to only use an interaction using the colon ":" in a formula such as y ~ a + a:b which defines y as a function of a and the interaction between a and b but does not account for the independent variable b itself.

3.0.0 Other general linear models for categorical independent variables

Scooby_doo_GLM.jpg

Up to now we've been comparing the effect of age and BMI on cholesterol levels but these are continuous variables. Remember we have also included haplotypes for three genetic loci in our dataset. How do we build a model analysing the influence of these independent variables on our outcome (cholesterol)?

The suite of general linear models includes many statistical models, which we will continue to discuss in the context of our data. Until now, we have been focused on the subgroup of regression models - in which all of our independent variables are quantitative. The remainder of our models, while still being general linear models, include one more categorical independent variables.

Recall our data tracks a number of genetic variants and codes the genotype of observations for each:

ID sex age chol BMI TG rs174548 rs4775401 APOE
patient code Male=0
Female=1
patient age cholesterol Body mass
index
triglycerides haplotype
C/C = 0
C/G = 1
G/G = 2
haplotype
C/C = 0
C/T = 1
T/T = 2
allele haplotype
e2/e2 = 1
e2/e3 = 2
e2/e4 = 3
e3/e3 = 4
e3/e4 = 5
e4/e4 = 6

3.1.0 One-way analysis of variance (ANOVA)

In the analysis of variance (ANOVA) all independent variables are categorical (factors) rather than continuous. This allows us to ask the question:

Does the genetic factor rs174548 have an effect on cholesterol levels?

Our categorical example is represented by $\alpha$ and $i$ represents the levels of our factor.

Expression:

\begin{equation*}\Large Y \sim Normal({\alpha_i}, {\sigma^2}) \end{equation*}

We still use the lm() function, however we replace our continuous variable with f, a categorical variable (factor). If your data is character type, R will automatically make a factor for you. However, if your data is numeric, R will interpret it as continuous. In this case, you need to make your numeric data a factor first using factor().

R code:

lm(y ~ f)

R parameterizes the model in terms of the differences between the first group and subsequent groups (ie. relative to the first group) rather than in terms of the mean of each group. This is similar to the interpretation of the previous linear models. You can instead fit the means of each group using: lm(y ~ f-1), similar to how we force the y-intercept to 0.

To begin the journey in answering our question, we first plot the relationship between rs174548 and cholesterol.


Why are the levels of our variable rs174548 set to NULL? Looking at the structure we see that this variable is listed as an int and is not yet a factor - so no levels exist.

Recall that we can coerce from integers and character variables into factors by casting them directly as we see below. We can do the same with variables, on the fly, as we plot our data into a scatterplot and boxplot.


Hard to see the distribution of our factors right? It looks all very mixed up. Perhaps a boxplot will clarify this data?


From our preliminary plots, we see that our genetic factor has 3 groups, and we will be comparing the means for each of these groups. These groups have high variance, and there is a good deal of overlap between them.

factor() versus as.factor() We've seen through our code that most data structures or objects can be cast in one of two ways: object() or as.object(). So which is better for using when we cast objects? Usually the former version is used in the basic construction of an object, when first instantiating it. There are additional steps "under the hood" that are optimized for initializing a data frame or factor. In the case of factor(), this includes going through all of the unique factors, sorting them, and identifying unused levels. On the other hand, as.factor() is slightly optimized for conversion of integers and pre-existing factor objects. When casting a factor back to a factor, this will not change the order of your levels! Looking closely at the documentation will also show that as.factor() only takes a single argument - the object for conversion, not setting levels, order, or labels!

3.1.1 Basic formulas for calculating One-way ANOVA F-values


\begin{equation*} \begin{aligned} SSR_{i} &=\sum n_{i} (\bar{y}_{i} - \bar{y})^{2} &\text{Deviation of the estimated factor level mean around the overall mean, at factor level } i \text{ with } n_{i} \text{ samples}\\[10pt] SSE &=\sum_{i=1}^{\substack{x factor\\levels}} \sum_{j}^{\substack{m\\obs.}} ({y}_{ij} - \bar{y}_{i})^{2} &\text{Sum of deviation of each observation from its factor level mean}\\[30pt] DF_{Factor} &= \text{factor levels} - 1 &\text{Degrees of freedom for factor component}\\[10pt] DF_{Error} &= \text{total obs.} - \text{factor levels} &\text{Degrees of freedom for the MSE componenet}\\[30pt] MSR_{i} &= \frac{SSR_{i}}{DF_{Factor}} &\text{The mean square for a factor level } i\\[10pt] MSE &= \frac{SSE}{DF_{Error}} &\text{The mean square error across all samples}\\[30pt] F &= \frac{MSR_{i}}{MSE} &\text{Our final F-value for a given factor level } i \text{ analyzing: } \frac{\text{between-groups variance}}{\text{within-group variance}} \end{aligned} \end{equation*}


Still, in a nutshell...

To assess whether the means are equal, the model compares:

F-value.example.png

Recall our formulas when comparing between models with anova()? You can see it's similar in concept except in the numerator and denominator are calculated in relation to the means of each factor level. The larger the MSR compared to the MSE, the more support there is for a difference between the population means.

So how do we put this to practical use in R?


3.1.2 Setting up a one-way ANOVA model for our question

Now that we know more about how a one-way ANOVA is calculated, let's talk about how write the expression to model our categorical variable. The form of this equation should look more familiar.

Expression:

\begin{equation*}\Large Y \sim Normal({\beta_0 + \beta_1x_1 + \beta_2x_2}, {\sigma^2}) \end{equation*}

The interpretation of this model is a bit trickier.

Alternatively,

So you can think of each of these groups having their own means, i.e. $\mu_0 = \beta_0$, $\mu_1 = \beta_0 + \beta_1$, $\mu_2 = \beta_0 + \beta_2$. We are testing the hypothesis that not all of these means are equal.


3.1.3 Behind the curtain, your categorical factor is a dummy variable

Before jumping into the ANOVA, we should consider some ways to help make our analysis simpler. For instance, we can encode our categorical variable as a dummy or treatment variable. In our data frame for the rs174548 variable, 0 stands for the genotype C/C, 1 is C/G and 2 is G/G. Therefore, there are $k$ factor levels to our categorical variable. Using this information, we can create a matrix with $k-1$ separate columns of 0s and 1s to represent our levels and we will contrast each to the reference level (C/C). A genotype will be compared to the reference genotype if it has a 1, and so all comparisons to the reference are included in the matrix.

rs174548 x1 x2
C/C 0 0
C/G 1 0
G/G 0 1

Luckily for us, R will automatically create dummy variables in the background if you state you have a categorical variable.

Recall that ANOVA is also based on testing hypotheses. It will use the F-value and degrees of freedom from our model to ascertain which of the following hypotheses we should accept/reject.

  1. $H_0$ (null hypothesis): there is no difference between group means

  2. $H_a$ (alternate hypothesis): there is a difference between group means

Also, there are technically 2 ways for us to do this analysis!


3.1.4 One-way ANOVA using lm() and summary()

As before, we can build our linear model with lm() and then generate summary statistics for it. This will give us some helpful information about the model in understanding or interpreting it's information.


3.1.4.1 Interpretation of the model and statistics

Under our Coefficients table we see that the individula p-values for seeing the cholesterol values in a specific haplotype given that the $H_0$ is true. Haplotype 0 is folded into the intercept.

Alternatively,

The summary tells us that the genetic factor rs174548 has an effect on cholesterol at a significance level of p = 0.01884 (< 0.05). Looking closely at the Adjusted R-squared value, we also see that this independent variable accounts for only 1.718% of variability in our cholesterol data.

An analysis of variance table from calling anova() with one model as input gives us the same F-value and p-value.

As you can see, we were able to retrieve the same information about the overall model showing that rs174548 genotype is a contributing factor to cholesterol values.


3.1.5 Use aov() and summary() to analyse our model

With the simple models we are generating, there is little advantage to using aov() over lm(). In fact, aov() calls on the lm() function behind the scenes. However, for more complicated models involving multiple factors and interactions, the summary output for aov() is better and it can also support error terms in your model.

So, leave this as an additional option to explore when generating more complex models.


3.1.6 Your ANOVA only tells if at least one population mean is significantly different!

Notice that the p-value for the whole set of comparisons is 0.01184 against our $H_0$ of the predictor, rs174585, having no relationship with the response variable of cholesterol. We must consider what this really means for the model we gave to lm.

All of this analysis tells us that there is a difference in means (rejects the null hypothesis that all means are the same), but it does not tell us which population/group means are significantly different from one another.

In order to look at this we need to look at multiple pairwise comparisons of

$$\mu_0 = \mu_1,\;\;\; \mu_0 = \mu_2,\;\;\; \mu_1 = \mu_2$$

3.2.0 Post-Hoc or Post-ANOVA tests: (multiple) test correction

Post-Hoc in Latin, means "after this", referring to analyses done after an ANOVA, with the objective of identifying which means, if any, show significant differences.

Multiple comparisons increase the family-wise error rate (FWER) - the probability of making a false discovery (a.k.a. a false positive or Type I error). This is where multiple test corrections come in to control the error at a specific threshold, i.e. $\alpha = 0.05$ or 5%. One of the simplest and conservative is the Bonferroni correction ($\alpha/k$ or multiplying p-values by $k$).

Remember: $k$ is the number of factor levels in our categorical variable. ie $k = 3$ for rs174585

To run multiple tests to see if group means differ we can use this equation for general linear hypothesis testing, which takes two objects:

  1. a model
  2. a contrast matrix

We'll start by remaking our model with a slight change to obtain the absolute mean of each group rather than the relative mean. Previously, it was mentioned that you can fit the means for each group using lm(y ~ f-1).


3.2.1 Build a contrast matrix with contrMat()

For the second object needed in our group comparisons, we want to make a contrast matrix. The simplest contrast matrix is a matrix of 0, 1, and -1's, where the relationship -1 and 1 are the factor levels for which you want to test the differences. Notice that the total of each row is 0.


3.2.2 General linear hypothesis testing with glht()

The rownames of the contrast matrix tell us what is being compared ([1-0], [2-0], [2-1]). For example [1-0] is the difference between C/C (-1) and C/G (1).

More complicated comparisons can be made. For example, the difference between C/C and the average of G/G and C/G could be specified by adding a row to the matrix of -2 1 1 (Note: rows of a contrast matrix must add to zero).

To get estimates using general linear hypothesis testing, we use the aptly named glht() function with our linear hypotheses to be compared as specified by our contrast matrix. The function will take in our ANOVA model and produce the summary information used to compare each combination of model variables put forth in our contrast matrix. Overall this takes the work out of creating our own series of comparisons.

We will first look at a summary without adjusting/correcting our p-values.


3.2.3 Interpretation

[1 - 0 == 0] The difference in means between C/C and C/G is 6.80 mg/dl and this difference is significant.
[2 - 0 == 0] The difference in means between C/C and G/G is 5.44 mg/dl and this difference is not significant.
[2 - 1 == 0] The difference in means between C/G and G/G is -1.36 mg/dl and this difference is not significant.

You may notice that some of these p-values look familiar from our original summary() of the model. We have, however, added the comparison of the C/G to G/G genotypes.


3.2.4 Correction for multiple testing

Don't forget we've tested 3 different pairings and need to account for the possibility of false positives.

But Why?

Remember: if we have set an $\alpha$ level of 0.05, then we may encounter an uncharacteristic result to our $H_0$ by chance once every 20 tests.

Two general procedures to control error rates: While we won't go over all of the ways to control for error, recall that there are two main types of errors we are usually concerned with. 1. Type-I error is the false rejection of the null hypothesis aka false-positives 2. Type-II erorr is the false acceptance of the null hypothesis aka false-negatives. Depending on the nature of your data, and how many tests you've completed you want to perform some kind of correction on your results. Two popular procedures for correcting type errors are the Bonferroni correction and Benjamini-Hochberg. The former strictly penalizes all p-values equally, thus controlling for Type-I error but perhaps introducing some more Type-II error - overall erring on the side of caution. The latter penalizes your p-values based on their ranking, attempting to control the proportion of wrongly rejected null hypotheses (False Discovery Rate) within the group of all rejected null hypotheses rather than the whole set of p-values supplied.

Let's check if multiple test correction affects these relationships.

The significant difference in mean cholesterol between C/C and C/G genotypes of rs174548 holds under different multiple test corrections.


3.3.0 Use multi-way analysis of variance (ANOVA) for two or more categorical variables

What if we use two or more categorical variables (factors) to model our outcome? We can now ask the question:

Does the effect of the genetic factor rs174548 differ between males and females?

We need to test whether there is an effect of our factors on cholesterol and also if there is an interaction between these factors.

Expression:

\begin{equation*}\Large Y \sim Normal({\alpha_i} + {\beta_j}, {\sigma^2}) \end{equation*}

$\alpha$ and $\beta$ are our categorical variables where $i$ is the level of the first group, and $j$ is the level of the second group. As with one-way ANOVA, R models our categorical variables as factors.

R code:

lm(y ~ factor1 + factor2): testing for main effects without interaction.
lm(y ~ factor1 * factor2): testing for the main effects with interaction.

The following diagram will help us visualize the differences in coefficients with and without interaction between 2 categorical variables. In this first scenario, the difference in the means between groups defined by factor B does not depend on the level of factor A and vice versa. This means that there is no interaction, and the lines between the factor groups are parallel. In the second scenario the difference in the means between groups defined by factor B changes when A2 is present. There is an interaction and the lines are not parallel.



2wayanova.png


3.3.1 Two-way ANOVA without interaction using lm() or aov()

Let's begin looking at our dependent variable cholesterol by assuming there is no interaction between sex and the allele rs174548. We have two options at this point to consider: do we use lm() or aov(). Let's compare the summary output from each.


3.3.1.1 lm() t-test summaries versus aov() anova tables

Looking closely at the summary we see there are two different outputs. The summary() of each object produces a slightly different format and result. The lm() summary has provided a series of t-test information looking at the levels of each factor, and how each level influences the value of cholesterol with respect to the baseline mean. Each level is also accompanied by a corresponding p-value as we've seen, and overall the model contains an ajusted R-squared value.

The aov() summary model on the other hand has less but also different information. Instead of notes the the p-values of each factor as a whole, giving us an idea if the term itself is significant to our model. If you were working with many idependent variables, you may be more concerned with discerning which ones actually influence the overal variance your dataset. Note also, the p-values of the sex variable in our two models. Why are they different?

Remember that both of these functions produces an object that contains all of the information for our model. What we need, however, is a way to summarize that information correctly.

Multiple Interaction terms: If we had time to dig deep into these two functions we would find that the most important difference, output aside, is that lm() cannot handle creating a model with multiple interaction terms. It can handle a single interaction term, but models with multiple interaction terms are best left to aov().

3.3.2 Use Anova() to interpret your results as an ANOVA table

Regardless of which function you use to generate your models you can generate summary ANOVA tables of each model using the cars::Anova() function. Yes, R does include an anova() function in the base package too! Just be careful in which one you decide to use as the former uses Type-III sum of squares to incorporate error terms while the latter uses Type-I.

Type-I or Type-III sum of squares? Whats the big difference between these two approaches and why should you care? Overall for simple models, the difference may not be apparent. It really all depends on how the alogrithm portions our error from each independent variable (factor) from your ANOVA.

Type-I is sensitive to factor order in your equation since it attributes overlapping SS error to specific factors. This is implemented in both aov() and anova()

Type-III on the other hand, is more conservative and leaves potential overlapping SS error out of the analysis but will not be affected by how you order your factors in your equation. This is implemented in both lm() and cars::Anova() For some extra reading on the topic, you can check out these helpful sites.

As you can see, there are slightly different results in terms of calculating the sum of squares error between the two ANOVA methods. This also results in different p-values for the independent variable of sex. Again, if we were working with many more potential independent variables, we might see more of an impact between the two SS types as well.

3.3.3 Interpreting a two-way ANOVA without interaction

Going back to the results of our lm() call:

-The estimated mean cholesterol for males in C/C group is the intercept, 175.36 mg/dl.
-The estimated difference in mean cholesterol between females and males controlled for genotype is 11.05 mg/dl.
-The estimated difference in mean between C/G and C/C groups controlled for sex is 7.24 mg/dl.
-The estimated difference in mean between G/G and C/C groups controlled for sex is 5.18 mg/dl.

There is evidence cholesterol level is associated with our rs174548 locus and sex (p < 0.001). This combined model also appears to explain more variability in the data than with just rs174548 (7.7% vs. 1.7%).

How does this compare to the model with sex alone as a predictor? Recall that in the case of comparing models, we should use the anova() function.


In conclusion, our combined non-interacting model explains more variability in our response variable vs looking at sex or rs174548 alone and while sex alone explains a good deal of variability, the combined model of both variables is still significantly different.

second_anova.jpg


3.3.4 Build our two-way ANOVA model with interaction

While we have concluded there is a difference between our two main ANOVA models, we have to address the possibility that there may be an interaction between these two terms in our model. Therefore we should now check the two-way ANOVA model with the interaction.


3.3.5 Interpreting our aov model

According to our Anova() table of the model, it does appear that there is a statistically significant interaction between sex and rs174548. If we are interested in the specific interaction, we can take a closer look with lm() in this case since we are only using a single interaction term in our model.


3.3.5.1 Interpreting a two-way ANOVA with interaction

-The estimated mean cholesterol for males in the C/C group is 178.12 mg/dl.
-The estimated mean cholesterol for females in the C/C group is (178.12 + 5.71) mg/dl.
-The estimated mean cholesterol for men in the C/G group (178.12 + 0.96) mg/dl.
-The estimated mean cholesterol for females in the C/G group is (178.12 + 5.71 + 0.96 + 12.74) mg/dl.

There appears to be a significant interaction between being female and having the C/G genotype (p<0.01).

Let's compare the 2 two-way ANOVA models we've generated.

There is evidence that these two models are different (p = 0.01483). Let's compare all three models as a last step.


3.3.6 Review the models and investigate further with TukeyHSD()

We've generated 3 models to explain our cholesterol values in terms of sex and the genetic locus rs174548.

Independent variables Model term relationship Adjusted R-squared F-value p-value AIC
sex One-way ANOVA NA 0.05763 25.4 7.087e-07 3592.509
sex, rs174548 Two-way ANOVA additive 0.07764 12.2 1.196e-07 3585.907
sex, rs174548 Two-way ANOVA interacting 0.09257 9.14 3.062e-08 3581.357

It looks like our third model that accounts for interaction between sex and rs174548 explains the most variation. To investigate the story further we would want to look at which interactions are possibly the most significant.

Is your ANOVA design balanced? While we haven't directly addressed our data, you should try to keep in mind that a balanced design will have the same number of observations for all possible level combinations. This is easier to accomplish when you are looking at a single factors and its levels. When you start using multiple factors and levels, the number of combinations begins to quickly increase!

Since we already have an ANOVA model and wish to make multiple comparisons, we can use the TukeyHSD() function (Tukey's Honestly Significant Differences) to identify which levels may interact across the factors in our model. This function works best on a balanced design but can adjust for mildly unbalanced designed as well.

The procedure creates a critical value based on your design and desired $\alpha$ level so the result already takes into account the multiple comparisons generated to provide an adjusted p-value. To recap, we'll need:


3.3.3.1 Interpreting our TukeyHSD() output

From above we can see our different groups compared to each other, with p-values adjusted for multiple comparisons. Looking directly at the interaction portion of the analysis, we can ascertain that there are 3 highly significant interacting groups, and one additional group with significant results.

Group 1 Group 2 Difference in means Adjusted p-value
Females: C/G Males: C/C 19.41 mg/dl 0.0000001
Females: C/G Females: C/C 13.69 mg/dl 0.0003054
Females: C/G Males: C/G 18.45 mg/dl 0.0000028
Females: C/G Males: G/G 19.61 mg/dl 0.0360298

It would appear that females with the C/G genotype may have a significantly increased cholesterol value across all of our samples. This would, however, require deeper analysis to confirm.

Note that if we had used TukeyHSD() on an additive model of our categorical variables, we would get a similar looking output measuring the differences between different factor levels but it would not include the a:b grouped comparisons.


3.4.0 Analysis of covariance (ANCOVA)

The analysis of covariance (ANCOVA) model allows us to compare the means of a dependent variable between two or more groups while taking into account variability of other variables (covariates). ANCOVA, therefore, uses a continuous variable as a linear regression component (covariate). In simpler terms, the ANCOVA builds an ANOVA model to explain the between-group variance but incorporates the covariates to help explain the within-group variance.

This allows us to ask the question:

Is the relationship between age and cholesterol affected by sex?

Our model won't look that different from our other equations except that we have categorical and continuous predictor variables.

Expression:

\begin{equation*}\Large Y \verb|~| Normal({\beta_i} + {\beta_{i}x}, {\sigma^2}) \end{equation*}


Parameters are the intercept of the first factor level, the slope with respect to $x$ for the first factor level, the differences in the intercepts for each factor level other than the first, and the differences in slopes for each factor level other than the first.

What exactly is a covariate? Usually in an ANOVA, covariates are characteristics of the observations, technically excluding the treatment variable. In the context of our dataset, the covariate may affect the outcome of the dependent variable. Covariates can be treated as independent variables or as unwanted confounding variables. Covariates are continuous measured/observed independent variables that co-vary with the dependent variable.

R code:

lm(y ~ f + x), testing for main effects without interaction.
lm(y ~ f*x), testing for main effects with interaction.

To answer our question, let's first take a quick look at sex differences in cholesterol in our dataset, keeping in mind that males are encoded as 0 and females as 1. Based on sex information alone, we see that women have a higher mean serum cholesterol, but we don't know if this is significant.


3.4.1 Conditions for performing the ANCOVA

Before moving on, there are some conditions we should look into before beginning our ANCOVA:

  1. Linearity between the covariate and the outcome variable at each level of the grouping variable. In this case we are referring to cholesterol (outcome) versus age (covariate) when grouped by sex.

  2. Homogeneity of regression slopes. This assumes no interaction between the independent variable and covariate. In other words, the b coefficients must be equal amongst all sub-groups.

  3. The residuals of the outcome variable should be approximately normally distributed.

  4. Homoscedasticity. Remember we want homogeneous residual variance for all groups.

  5. No significant outliers in the group.

3.4.1.1 Check your linearity condition

We've been looking at this data quite a bit, but let's remind ourselves that there remains linearity between age and cholesterol when grouping by sex. We can generate a scatterplot with regression lines to help us visually assess the situation.


From above, our visual inspection of the scatterplot suggests that we still see a trend between increasing age and cholesterol even when grouping by sex. Sex doesn't change the relationship between age and cholesterol as these lines are almost parallel ie no interaction. We can also see females when compared to males, on average, appear to have a higher mean cholesterol score with increasing age.

3.4.1.2 Check for homogeneity of the regression slopes with an ANOVA

For simplicty, we can build a model based on an interaction between sex and age using aov() and look at the results.


From above, we see that both sex and age appear to have a significant effect on cholesterol values. This we have already seen in our previous models. We can also see from our interaction test "sex:age" that there is a low F-value and the associated p-value is high so no significant interaction occurs between the two variables.

From here, we'll assume the remaining requirements 3-5 have been met. Let's take a look at the model itself.


3.4.2 Interpreting our ANCOVA without interactions

Controlling for sex, mean cholesterol increases by 0.29697 mg/dl for an additional year of age. This is close to the slope for our model of cholesterol alone, 0.31 mg/dl. This does not necessarily mean that the age/cholesterol relationship is the same in males and females; we need to check out the interaction term. There appears to be an increase in mean serum cholesterol of 10.5 mg/dl in females over males.

Let's build an lm model that includes interaction between age and sex.


3.4.3 Interpreting our ANCOVA with interactions

Males are coded as 0 and females are coded as 1 in this model.

The intercept term is the mean serum cholesterol for MALES at age 0 (160.31 mg/dl). The slope term for age is the difference in mean cholesterol associated with a one year change in age for MALES. The slope for sex (14.56 mg/dl) is the difference in mean cholesterol between males and females at age 0. The interaction term is the difference in the change in mean cholesterol associated with each one year change in age for females compared to males. Sex exerts a small and not statistically significant effect on the age/cholesterol relationship.


3.4.4 Compare the results of our many models

Let's compare the results of our models with and without an interaction term.

ANCOVA Model Male Intercept Male slope Female Intercept Female Slope
No Interaction 162.35 0.29 172.85 0.29
WIth Interaction 160.31 0.33 174.87 0.26

They seem quite similar and we also already know from testing for our conditions, that there is no significant interaction between age and sex. Still, let's confirm by comparing the two models directly.


3.4.4 Does our ANCOVA model improve on our one-way ANOVA?

In conclusion, adding the interaction term did not change the model significantly. Does our ANCOVA model ageSexFit grouping cholesterol values by sex and using age as a covariate improve on our one-way ANOVA model sexfit where we predict cholesterol using only sex?

Indeed the addition of our age covariate does result in a very different model from predicting by sex alone. Furthermore, the addition of age improves on the model based on a number of indicators including the AIC comparison of the models.


4.0.0 Review: Models we used today

bar_too_low_glms2.jpg

Before we move on, I want to take a step back and quickly review the models and code we've gone through today. Firstly, with our example dataset, and then more generally. I hope you can see that although conceptually different, getting a handle on the code isn't too bad.

For all of these models we are trying to determine the effect of different variables on cholesterol. The differences are whether we are using:

model categorical continuous R code
simple linear regression X $\checkmark$ lm(chol ~ age)
multiple linear regression X $\checkmark\checkmark$ lm(chol ~ age + BMI)
lm(chol ~ age*BMI)
one-way analysis of variance (ANOVA) $\checkmark$ X lm(chol ~ as.factor(rs174548))
multi-way analysis of variance (ANOVA) $\checkmark\checkmark$ X lm(chol ~ as.factor(sex) + as.factor(rs174548)
lm(chol ~ as.factor(sex))*as.factor(rs174548))

aov(chol ~ as.factor(sex) + as.factor(rs174548)
aov(chol ~ as.factor(sex))*as.factor(rs174548))
analysis of covariance (ANCOVA) $\checkmark$ $\checkmark$ lm(chol ~ as.factor(sex) + age)
lm(chol ~ as.factor(sex)*age)

Remember that you can produce ANOVA tables using Type-III sum of squares using the car::Anova() function. We have started with models that assume normally distributed errors, but there models left unexplored in this lecture.


4.0.1 Quick code tables for regression models

In the table below, our R code for each of the models has been generalized. Here, $y$ is our predictor variable, $x$ is a continuous variable, and $f$ is a categorical variable (assumed to be a factor already).

model R code
simple linear regression lm(y ~ x)
multiple linear regression lm(y ~ x + I(x^2)
lm(y ~ x1 + x2)
lm(y ~ x1*x2)
one-way analysis of variance (ANOVA) lm(y ~ f)
multi-way analysis of variance (ANOVA) lm(y ~ f1 + f2)
lm(y ~ f1*f2)
analysis of covariance (ANCOVA) lm(y ~ f + x)
lm(y ~ f*x)

You need not memorize any of these charts - you may just want to use them to orient yourself in the future. Much of the R code seems the same whether you are doing multiple linear regression, ANOVA or ANCOVA, so it is good to have a reference point.


4.0.2 Linear models as a "make the appropriate choice" adventure

A non-exhaustive but fair summary of what we've discussed today. This generally applies to the small corner of linear regression we've covered today.

Linear_Model_Flowchart.png


4.1.0 Examine the effects of rs174548 and age from our data

Does the effect of the genetic factor rs174548 differ depending on a subject's age?

This sounds like we want to use ANCOVA since we have a categorical (rs174548) and continuous (age) varaible. Let's break the process into steps again:

  1. Graph the dependent outcome against the covariate and compare based on subgroups.
  2. Build additive and interacting models.
  3. Check the F-statistics for interaction.
  4. Compare to simple regression models.

4.1.1 Visualize our data and look for interaction

Make a plot of age versus cholesterol and color points by genotype. Add a linear model to the plot. Are you expecting an interaction based on this plot?


There appear to be some non-parallel slopes going on with our visualization which suggests there could be an interaction between our covariate and independent variable.

4.1.2 Build your potential models - both additive and interacting

Time to build our models for comparison against each other. We will use aov() to produce both an additive and interacting model involving rs174548 and age. We'll begin with the interacting model to see if it follows with our observations thus far.


Although the slopes aren't exactly parallel between all the rs174548 genotypes, there is no significant interaction detected. Take a look at the ANCOVA as an additive model.


Look at the summary statistics for each model fit. How would you interpret the results? Notice that level 2 of rs174548 (ie the blue line in our visualization) has no significant p-value versus the baseline level of 0.

4.1.3 Compare your ANCOVA models against each other and simpler models

Let us compare the two models with an analysis of variance table to make sure that interaction doesn't really add any information to our model. Then we can compare against some simpler models using just age and rs174548.


Using AIC, we see that modeling cholesterol variation using the additive model of age and rs174548 explains more variation than using age or rs174548 alone.

4.2.0 Examine the potential effects of APOE genotype and age on cholesterol

Let's dig deeper into our data and see what the genetic locus APOE has on cholesterol values. Some questions we can ask and tasks we can perform:

  1. Does the genetic factor APOE have an effect on cholesterol levels?
  2. If so, which variants appear to have a significant effect?
  3. Does this interaction vary depending on a subject's age?
  4. Plot the relationship between APOE and cholesterol.
  5. Generate models to compare the effects of these variables on cholesterol values.
  6. Which model does the 'best' job.

Looking at the visualization, we should note a couple of things:

  1. There does appear to be some difference between the different APOE genotype groups in terms of cholesterol values
  2. The genotypes are not evenly distributed across the groups. How might this affect our later analyses?

4.2.1 Build our model using APOE alone

Recall there are two ways we can build our ANOVA to look at the effect of each genotype. We can either compare it to a reference genotype (level 1, e2/e2) or we can produce the intercept for each group level by including a -1 in our model formula.


4.2.2 Compare all of the individual groups to each other

From our first glimpse we see that the all of the levels are different from the base reference genotype of e2/e2. Do you recall how we can compare all of the groups against each other?

Recall we can build a contrast matrix - a simple way of describing which groups should be directly compared to each other. We will again use the contrMat() function in the format of a Tukey HSD. From there we can just pass this on to the glht() function for comparison.


4.2.2.1 Consider TukeyHSD() on your unbalanced design

We we previously suggested that Tukey HSD works best on a balanced design, there is actually a secondary implementation known as the Tukey-Kramer test. The TukeyHSD() function actually implements this under the hood so we could skip the contrast matrix step altogether but we don't get to dictate how our multiple comparisons are corrected.


4.2.3 Conclusions about the APOE effect

The APOE genotype has a significant effect on cholesterol with the most significant differences being between genotype 1 of the homozygous $\Large\varepsilon$2 alleles and genotype combinations of the $\Large\varepsilon$3 and $\Large\varepsilon$4 alleles. APOE genotype 5 is also significantly different from genotypes 2 and 4. However, the significant difference between genotype 5 and 2 is lost after multiple test correction.

4.2.4 Digging deeper into our model

Does this effect change with age or does it hold when age is held constant? Let's investigate further.

We'll want to

  1. Plot our data to check for linearity and homogeneity of regression.
  2. Generate our ANCOVA models of APOE and age.
  3. Compare our models against each other and our previous ANOVA on just APOE.

4.2.5 Build our ANCOVA interacting model

It looks as though we are in a similar situation as our last analysis where it looks as though there may be some interaction with between age and APOE based on the regression we've added. We'll investigate further by building the interacting model first with aov().


4.2.6 Build an ANCOVA additive model using lm()

From the analysis, there appears to be no significant interaction between APOE and the covariate of age so we can proceed to build an ANCOVA without interaction. We'll build our additive model using the lm() function so we can quickly see how each subgroup compares to the reference genotype.


4.2.7 Compare models across different analyses

Now it's time to return to our various models for comparison again. How different are the additive versus interacting ANCOVA models? How does that compare to looking at the ANOVA model with APOE alone or age alone? We can finish up with a call to AIC() for confirmation.


4.2.8 Conclusion about the effect of APOE and age on cholesterol

The APOE genotype has a significant effect on cholesterol. This relationship remains while holding age constant. There is no significant interaction between APOE genotype and age. Our additive model performs better than one that includes an interaction between our independent variable and the covariate. Therefore the 'best' model incorporates both APOE genotype and age as an additive model.


5.0.0 Class summary

That's the end for our sixth and penultimate class on R! You've made it through linear models and we've learned about the following:

  1. Exploring the distribution of your data.
  2. Simple and additive linear regression models.
  3. ANOVA models for categorical variables.
  4. ANCOVA models for analysis of covariates.

5.1.0 Post-lecture assessment (12% of final grade)

Soon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete all chapters from the Correlation and Regression in R course (4,200 points total). This is a pass-fail assignment, and in order to pass you need to achieve a least 3,150 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the entire course summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course and you'll want to expand each section. It should look something like this:

DataCampIntroR.jpg

You may need to take several screenshots if you cannot print it all in a single try. Submit the file(s) or a combined PDF for the homework to the assignment section of Quercus. By submitting your scores for each section, and chapter, we can keep track of your progress, identify knowledge gaps, and produce a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, October 21st to submit your assignment (right before the next lecture).


5.2.0 Acknowledgements

Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.3.0 Your DataCamp academic subscription

This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.

Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.

DataCampLogo.png


5.4.0 Resources

https://github.com/ttimbers/lm_and_glm/blob/master/lm_and_glm.Rmd
https://github.com/seananderson/glmm-course
http://michael.hahsler.net/SMU/EMIS7331/R/regression.html
https://ms.mcmaster.ca/~bolker/emdbook/book.pdf
http://www.differencebetween.net/science/mathematics-statistics/difference-between-ancova-and-regression/
https://stats.stackexchange.com/questions/77563/linear-model-fit-seems-off-in-r
https://www.biostat.washington.edu/suminst/archives/SISG2016/SM1604
https://ms.mcmaster.ca/~bolker/
http://www.mathnstuff.com/math/spoken/here/2class/90/htest.htm


6.0.0 Appendix

6.1.0 Prediction

When predicting values you are assuming that your model is true. This might be fair within the range of your data. This is to be interpreted with caution outside the range of your data. For example, polynomial data may look linear over a certain range.

extrapolating.png

The predict() function works with many different kinds of fits: not just linear models but nonlinear, polynomial, generalized linear models, etc. predict will try to guess the fit based on the object input, but this information can be specified using predict.lm. The help page for predict.lm is more useful as it is specific for the linear model fit.


6.1.1 Challenge

Use predict.lm() to predict the mean cholesterol at age 47 from our model object ageFit. Recall that ageFit is our first model:

lm(chol ~ age, data = cholesterol).

In addition to the linear model, the function needs the newdata that we want to predict. Note that newdata takes in a data frame. We can predict the mean cholesterol at age 47 within a confidence interval that can be specified using level. The output is the mean, as well as the upper and lower boundaries of the estimate.

We can also use a 'prediction' interval.


Notice the difference in the upper and lower boundaries for these predictions. The first is the prediction for the mean serum cholesterol for individuals of age 47 and the second is for a single new individual of age 47. The second prediction has to account for random variability around the mean, rather than just the precision of the estimate of the mean. This may seem like a subtle difference, but as you can see it can change our boundaries quite a bit - we need to be clear on the question we are asking.

For our multiple linear regression model explaining cholesterol as a function of age and BMI (ageBMIFit), we could ask what cholesterol is predicted to be for a 60-year-old at a BMI of 21, a 60-year-old at a BMI of 26, and a 60-year-old at a BMI of 30. The standard error on the estimate of your means is obtained by setting se.fit = TRUE.


6.2.0 Assessing the performance of the model (feedback)

6.2.1 Checking residuals by plotting

Residuals are the differences between the observed response and the predicted response, and can be used to identify poorly fit data points, unequal variance (heteroscedasticity), nonlinear relationships, and examine the normality assumption.

We can plot the residuals vs $x$, residuals vs $y$, or a histogram of the residuals to see if there are any patterns. For example, plotting residuals against $x$ (age), should be unstructured and centered at 0.

If the residuals look like they are grouped in one section of the plot, or follow a pattern, then the model is not a good fit (ie. looks quadratic - you would have a nonlinear association). If it looks like a sideways tornado, then errors are increasing with $x$, and this is non-constant variance.

The residuals are found in our lm object, ageFit, which also contains the inputs of our model; it is a list of 12. You'll see that $residuals are in the second entry of our structured list.

We can use the broom() package to get information out of linear model objects into the glorious dataframe format that we know and love. This is done using the augment function.

Now that we have a data frame we can plot our residuals (.resid) versus our fitted data (.fitted).


6.2.1.1 Plotting residuals from fitted categorical variables

You can plot the fitted and residual values with a categorical variable, but it is sometimes difficult to view patterns. For example, here is what plotting the residuals for our model of cholesterol as a function of genotype would look like.

Instead, you can perform a statistical test of equal variance.

Bartlett's test can test whether or not population (group) variances are the same. We can see if variances are equal in our model of cholesterol as a function of sex by inputting the formula and dataset into the bartlett.test() function.

The p-value is telling us that the variance is not statistically different between our populations. Our assumption of equal variance is valid.


6.2.2 QQ-plots of residuals

QQ-plots (quantile-quantile) are a tool to answer the question: Does our data plausibly come from the (normal) distribution? The data is plotted against a theoretical distribution. Points should fall on the straight line. Any data points not fitting are moving away from the distribution.

The stat_qq geom from ggplot2 allows us to plot our residuals along the y-axis in ascending order, and theoretical quantiles of a normal distribution along the x-axis. A straight line can be added to see where residuals fall with stat_qq_line.

This looks pretty straight. We likely have normality of errors.

Let's try a less perfect example and look at the relationship between age and triglycerides (TG). Make a scatterplot of age and triglycerides with a linear fit to take a look at the data.

Write a linear model for triglyceride levels as a function of age. Use broom to get the output of the lm object into data frame format.

Plot the residuals against the fitted values. Does the variance look equal across the residuals?

Our residuals are increasing with increasing values of $x$ (.fitted).

What do the residuals look like in a qq-plot?

Our qqplot points are deviating from the line suggesting a poor fit for our model.


6.3.0 Challenge

For a) the anova model of the effect of the genetic factor APOE on cholesterol levels and b) the ancova model of age + APOE genotype on cholesterol levels: can you use any tools to assess whether the assumptions of your model are accurate?

The bartlett test tells us that the variances are significantly different between the APOE genotype groups, so the assumption of equal variance is false. We can also see this by plotting the residuals for the model with APOE genotype + age, whose pattern looks like an arrowhead. The qq-plot, however, looks reasonable.


6.4.0 Next Steps (or When Assumptions Fail)

The consequences of violating the assumptions for linear models depends, of course, on the assumption being violated. The worst offence, of course, is having non-linearity of your parameters in which case you are using the wrong model.

Our last example had a case of non-constant variance (heteroscedasticity). This means that there is a mean-variance relationship (recall the tornado shape). In this case the parameter estimates are minimally impacted, however variance estimates are incorrect.

To account for this we can use:

  1. Data transformation
  2. Robust standard errors
  3. Use a different model that does not assume constant variance (glm)

Data transformation can solve some nonlinearity, unequal variance and non-normality problems when applied to the dependent variable, the independent variable, or both. However, interpreting the results of these transformations can be tricky.

We corrected the non-constant variance issue, but it is harder to interpret our model. Exploring how to address these issues is beyond the scope of this appendix but some additional resources include:

  1. gee package, a generalized esimation equation solver similar to lm() except you include an additional 'id' variable to solve.
  2. glm(), a generalized linear model function that can use distributions other than the normal Gaussian.

CAGEF_new.png